92 research outputs found

    A Lite Romanian BERT:ALR-BERT

    Get PDF
    Large-scale pre-trained language representation and its promising performance in various downstream applications have become an area of interest in the field of natural language processing (NLP). There has been huge interest in further increasing the model’s size in order to outperform the best previously obtained performances. However, at some point, increasing the model’s parameters may lead to reaching its saturation point due to the limited capacity of GPU/TPU. In addition to this, such models are mostly available in English or a shared multilingual structure. Hence, in this paper, we propose a lite BERT trained on a large corpus solely in the Romanian language, which we called “A Lite Romanian BERT (ALR-BERT)”. Based on comprehensive empirical results, ALR-BERT produces models that scale far better than the original Romanian BERT. Alongside presenting the performance on downstream tasks, we detail the analysis of the training process and its parameters. We also intend to distribute our code and model as an open source together with the downstream task.publishedVersio

    Natural Language Question Answering in Open Domains

    No full text
    Abstract: With the ever-growing volume of information on the web, the traditional search engines, returning hundreds or thousands of documents per query, become more and more demanding on the user patience in satisfying his/her information needs. Question Answering in Open Domains is a top research and development topic in current language technology. Unlike the standard search engines, based on the latest Information Retrieval (IR) methods, open domain question-answering systems are expected to deliver not a list of documents that might be relevant for the user‘s query, but a sentence or a paragraph answering the question asked in natural language. This paper reports on the construction and testing of a Question Answering (QA) system which builds on several web services developed at the Research Institute for Artificial Intelligence (ICIA/RACAI). The evaluation of the system has been independently done by the organizers of the ResPubliQA 2009 exercise and has been rated the best performing system with the highest improvement due to the natural language processing technology over a baseline stateof-the-art IR system. The system was trained on a specific corpus, but its functionality is independent on the linguistic register of the training data

    Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging

    No full text
    The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results
    corecore